During the last 2 years, COVID-19 has been a main focus of the news. Though around 3% of the world population had COVID-19, diabetes can be considered as an even bigger health problem. Indeed, according to the International Diabetes Foundations (IDF), in 2019, 463 million adults were living with diabetes (around 6-7% of the world population) and this number is forecasted to rise to 700 million by 2050. Furthermore, 90% of cases of diabetes are of type 2, which means it results mainly from bad habits and not genetics. However both types of diabetes can be treated and/or prevented with a healthier diet and more physical activity. Additionally, according to the WHO, low income countries are more susceptible to having higher diabetes prevalence. Living in Europe, we observed that diabetes rates differ a lot from one country to another, so we wanted to find out if these rates were indeed linked to a country’s income, and if the nutritious composition of richer states’ population’s diet is also affected by this income difference and if yes, how it is affected.
Therefore, we would like to find out answers to the following questions :
Do European countries that have higher GDPs really have lower diabetes prevalence ?
Do European countries that have higher GDPs consume less calories ?
How do the proportions of macronutrients (animal protein/plant protein/fat/carbohydrates) consumed differ between richer and poorer governments ?
And how do these differences relate to the diabetes prevalence in these countries ? What is the typical diet that can be observed in richer states that relates to lower diabetes prevalence ?
To answer our research questions, we used three different datasets. While searching for datasets, we made sure that the years and countries matched for every one of them.
The first dataset we used, downloaded from the portal https://ourworldindata.org/diet-compositions, contains information related to the supply of macronutrients in calories for different countries. We used data related to food supply rather than food consumption, as the latter is harder to find and generally, supply reflects the population’s demand and therefore its food consumption. The dataset gives us information on the average nutrition of different countries from 1961 to 2013 :
It is composed of 8981 observations of 7 variables:
Entity Name of the countryCode ISO country codeYearYear of the observationCalories from animal protein (FAO (2017)) The average per capita supply of calories derived from animal protein all measured in kilocalories per person per dayCalories from plant protein (FAO (2017)) The average per capita supply of calories derived from plant protein, all measured in kilocalories per person per dayCalories from fat (FAO (2017))The average per capita supply of calories derived from fat, all measured in kilocalories per person per dayCalories from carbohydrates (FAO (2017)) The average per capita supply of calories derived from carbohydrates, all measured in kilocalories per person per dayThe intake of specific macronutrients (carbohydrates, protein and fats) are derived based on average food composition factors – these factors are derived and presented in the Food and Agriculture Organisation’s (FAO) Food Balance Sheet Handbook (https://www.fao.org/faostat/en/#data).
We will only focus on observations of European countries in the 2000s.
We used the ISO code as it is standardized worldwide and does not have the risk of having different names in different tables like the countries’ names.
Then, we proceeded to compute the mean of the consumption for each type of macronutrient in each country between the years 2000 and 2013, and the sum of total calories per person per day for each country in order to answer our second research question.
We then created a new table by adding the sum of total calories per person per day for each country in order to get a broader view with the total consumption of calories. To make sure that the joining of tables go smoothly, we also removed duplicates and the country name column.
Our assumption was that a county’s wealth may fluctuate over the course of 10 years (ex: a dip during the economic crisis of 2008) but an overall mean is sufficient to compare the different countries and their riches.
We now have a dataframe with the following variables :
country_code ISO country codecal_prot_animalThe mean of the calories from animal protein consumed per person in each country in the years 2000-2013cal_prot_plant The mean of the calories from plant protein consumed per person in each country in the years 2000-2013cal_carbsThe mean of the calories from carbohydrates consumed per person in each country in the years 2000-2013cal_fat The mean of the calories from fat consumed per person in each country in the years 2000-2013total_consumption The total calorie consumption per person based on the means of the consumption of each type of macronutrients in each countries in the years 2000-2013| Country Code | Calories from animal protein | Calories from plant protein | Calories from carbohydrates | Calories from fat | Total consumption |
|---|---|---|---|---|---|
| AUT | 245 | 169 | 1833 | 1454 | 3702 |
| BEL | 238 | 158 | 1856 | 1467 | 3719 |
| BGR | 155 | 168 | 1606 | 846 | 2775 |
| HRV | 168 | 148 | 1691 | 940 | 2946 |
| CYP | 197 | 127 | 1291 | 1019 | 2633 |
| CZE | 218 | 153 | 1728 | 1155 | 3254 |
| DNK | 273 | 157 | 1746 | 1190 | 3366 |
| EST | 212 | 167 | 1967 | 842 | 3188 |
| FIN | 269 | 166 | 1623 | 1177 | 3234 |
| FRA | 293 | 161 | 1611 | 1480 | 3545 |
| DEU | 240 | 159 | 1785 | 1276 | 3460 |
| GRC | 250 | 204 | 1744 | 1338 | 3536 |
| HUN | 193 | 153 | 1560 | 1221 | 3126 |
| IRL | 279 | 174 | 1950 | 1187 | 3590 |
| ITA | 241 | 204 | 1776 | 1390 | 3612 |
| LVA | 203 | 154 | 1687 | 1044 | 3087 |
| LTU | 274 | 198 | 2055 | 858 | 3385 |
| LUX | 288 | 148 | 1752 | 1318 | 3507 |
| MLT | 237 | 204 | 1932 | 994 | 3367 |
| NLD | 292 | 136 | 1599 | 1195 | 3222 |
| POL | 204 | 197 | 1969 | 1035 | 3405 |
| PRT | 275 | 177 | 1824 | 1240 | 3516 |
| ROU | 200 | 220 | 2003 | 916 | 3340 |
| SVK | 140 | 150 | 1610 | 952 | 2853 |
| SVN | 230 | 168 | 1664 | 1067 | 3129 |
| ESP | 279 | 159 | 1481 | 1323 | 3243 |
| SWE | 285 | 143 | 1566 | 1137 | 3131 |
| CHE | 237 | 138 | 1660 | 1392 | 3426 |
| GBR | 232 | 178 | 1748 | 1256 | 3414 |
Our second dataset, downloaded from the portal https://data.worldbank.org, gives us information about the GDP of many countries over the course of 60 years (1960-2020).
It is composed of 266 observations of 65 variables :
Country Name Name of the countryCountry Code ISO country codeIndicator Name equal to “GDP in current US$” for every rowIndicator Code equal to “NY.GDP.MKTP.CD” for every rowAs we can see below, RStudio imported the Excel file as is, and so our column names found themselves at the 3rd row and therefore column names of columns 3 to 65 have been attributed numbers.
We decided to fix that and to filter out the years that is in our interest and that we have in common with other tables, so the years 2000-2013. We decided to get rid of the Indicator Name and Indicator Code variables as well since the values are the same for every row and they do not provide additional useful information.
Now, we want to filter out the European countries, just like in the first table :
In order to join tables easily, we transformed the columns corresponding to different years to a single “year” column, in order to have at each row of this dataset the GDP of a certain country at a certain year.
To make it easier to manipulate data, we decided to rename our variables for this table as well. We also made sure that the type of our numeric variable (GDP) was numeric and not character, like it was by default. In order to have graphs that are easy to read in the exploratory data analysis, we also decided to divide the avg_gdp column by a billion.
Lastly, we computed the average GDP for each country over the years 2000-2013 in order to be able to plot different variables together.
| Country Name | Country Code | Average GDP (in billion $) |
|---|---|---|
| Austria | AUT | 335.98 |
| Belgium | BEL | 406.97 |
| Bulgaria | BGR | 37.41 |
| Croatia | HRV | 48.12 |
| Cyprus | CYP | 20.15 |
| Czech Republic | CZE | 158.48 |
| Denmark | DNK | 275.37 |
| Estonia | EST | 16.47 |
| Finland | FIN | 216.58 |
| France | FRA | 2283.48 |
| Germany | DEU | 3003.51 |
| Greece | GRC | 246.19 |
| Hungary | HUN | 110.92 |
| Ireland | IRL | 203.04 |
| Italy | ITA | 1872.80 |
| Latvia | LVA | 21.04 |
| Lithuania | LTU | 30.77 |
| Luxembourg | LUX | 42.85 |
| Malta | MLT | 7.27 |
| Netherlands | NLD | 721.70 |
| Poland | POL | 365.43 |
| Portugal | PRT | 200.37 |
| Romania | ROU | 125.08 |
| Slovak Republic | SVK | 70.82 |
| Slovenia | SVN | 39.50 |
| Spain | ESP | 1176.91 |
| Sweden | SWE | 425.83 |
| Switzerland | CHE | 490.18 |
| United Kingdom | GBR | 2416.76 |
We now have a dataframe with the following variables :
country_name name of the countrycountry_code ISO code of the countryavg_gdp the average GDP of a country over the course of 2000-2013Since we will be observing the relation between the GDP with the calories consumed per person, it could be useful to have the GDP per person for the analysis. This is why we will be importing this dataset from https://data.worldbank.org/indicator/SP.POP.TOTL which gives us information on the evolution of the population per country over 1960-2020.
It is composed of 266 observations of 65 variables :
Country Name Name of the countryCountry Code ISO country codeIndicator Name equal to “Population, total” for every rowIndicator Code equal to “SP.POP.TOTL” for every rowAs this dataset comes from the same source and is the same file type as GDP, we can proceed with the same wrangling
In order to analyze the link between the GDP per person and calorie consumption per person, we will create a separate table which we will join to the final clean dataset.
We now have a dataframe with the following variables :
country_name name of the countrycountry_code ISO code of the countrygdp_per_person the average GDP per person of a country over the course of 2000-2013The third dataset, downloaded from https://www.ncdrisc.org/data-downloads-diabetes.html, gives us information about the age-standardised diabetes prevalence for each country and gender from 1980 to 2014.
It is composed of 14’000 observations for 7 variables :
Country/Region/World Name of the countryISO ISO country codeSex Gender for which the diabetes prevalence is measured in a certain country at a certain yearYear Year of observation (1980-2014)Age-standardised diabetes prevalence Diabetes rate considering all agesLower 95% uncertainty interval Lower confidence interval limit for the diabetes rateUpper 95% uncertainty interval Higher confidence interval limit for the diabetes rateLike with the first 2 datasets, we filtered our data to keep only European countries and observations between the years 2000 and 2013.
We also decided not to use the 95% confidence interval variable.
Then, we separated our dataset into two subsets. One with data about men.
Another one with data about women.
We then changed the variable names of these 2 subsets to facilitate joining tables later on.
Finally we grouped observations by country to get the mean prevalence/rate of diabetes between 2000 and 2013 for each European country :
| Country Code | Diabetes rate |
|---|---|
| AUT | 0.053 |
| BEL | 0.057 |
| BGR | 0.073 |
| CHE | 0.050 |
| CYP | 0.077 |
| CZE | 0.078 |
| DEU | 0.059 |
| DNK | 0.055 |
| ESP | 0.084 |
| EST | 0.071 |
| FIN | 0.066 |
| FRA | 0.071 |
| GBR | 0.063 |
| GRC | 0.069 |
| HRV | 0.071 |
| HUN | 0.080 |
| IRL | 0.069 |
| ITA | 0.065 |
| LTU | 0.078 |
| LUX | 0.068 |
| LVA | 0.071 |
| MLT | 0.088 |
| NLD | 0.052 |
| POL | 0.074 |
| PRT | 0.075 |
| ROU | 0.062 |
| SVK | 0.072 |
| SVN | 0.066 |
| SWE | 0.058 |
| Country Code | Diabetes rate |
|---|---|
| AUT | 0.053 |
| BEL | 0.057 |
| BGR | 0.073 |
| CHE | 0.050 |
| CYP | 0.077 |
| CZE | 0.078 |
| DEU | 0.059 |
| DNK | 0.055 |
| ESP | 0.084 |
| EST | 0.071 |
| FIN | 0.066 |
| FRA | 0.071 |
| GBR | 0.063 |
| GRC | 0.069 |
| HRV | 0.071 |
| HUN | 0.080 |
| IRL | 0.069 |
| ITA | 0.065 |
| LTU | 0.078 |
| LUX | 0.068 |
| LVA | 0.071 |
| MLT | 0.088 |
| NLD | 0.052 |
| POL | 0.074 |
| PRT | 0.075 |
| ROU | 0.062 |
| SVK | 0.072 |
| SVN | 0.066 |
| SWE | 0.058 |
We now have 2 dataframes with the following variables :
country_code ISO code of the countryprop_men_diabetes or prop_women_diabetesthe average diabetes rate in a country in the 2000-2013 timeframeFor the last step of our tidying, we joined all four tables in one dataset with the country_code key :
| Country Name | Country Code | Average GDP (in billion $) | GDP per person (in $) | Men Diabetes | Women Diabetes | Calories from animal protein | Calories from plant protein | Calories from carbohydrates | Calories from fat | Total consumption |
|---|---|---|---|---|---|---|---|---|---|---|
| Austria | AUT | 335.98 | 40545 | 0.053 | 0.034 | 245 | 169 | 1833 | 1454 | 3702 |
| Belgium | BEL | 406.97 | 38015 | 0.057 | 0.039 | 238 | 158 | 1856 | 1467 | 3719 |
| Bulgaria | BGR | 37.41 | 4988 | 0.073 | 0.064 | 155 | 168 | 1606 | 846 | 2775 |
| Croatia | HRV | 48.12 | 11189 | 0.071 | 0.059 | 168 | 148 | 1691 | 940 | 2946 |
| Cyprus | CYP | 20.15 | 18890 | 0.077 | 0.056 | 197 | 127 | 1291 | 1019 | 2633 |
| Czech Republic | CZE | 158.48 | 15283 | 0.078 | 0.065 | 218 | 153 | 1728 | 1155 | 3254 |
| Denmark | DNK | 275.37 | 50213 | 0.055 | 0.035 | 273 | 157 | 1746 | 1190 | 3366 |
| Estonia | EST | 16.47 | 12283 | 0.071 | 0.064 | 212 | 167 | 1967 | 842 | 3188 |
| Finland | FIN | 216.58 | 40806 | 0.066 | 0.044 | 269 | 166 | 1623 | 1177 | 3234 |
| France | FRA | 2283.48 | 35700 | 0.071 | 0.044 | 293 | 161 | 1611 | 1480 | 3545 |
| Germany | DEU | 3003.51 | 36733 | 0.059 | 0.040 | 240 | 159 | 1785 | 1276 | 3460 |
| Greece | GRC | 246.19 | 22344 | 0.069 | 0.060 | 250 | 204 | 1744 | 1338 | 3536 |
| Hungary | HUN | 110.92 | 11050 | 0.080 | 0.063 | 193 | 153 | 1560 | 1221 | 3126 |
| Ireland | IRL | 203.04 | 46899 | 0.069 | 0.049 | 279 | 174 | 1950 | 1187 | 3590 |
| Italy | ITA | 1872.80 | 32000 | 0.065 | 0.047 | 241 | 204 | 1776 | 1390 | 3612 |
| Latvia | LVA | 21.04 | 9774 | 0.071 | 0.065 | 203 | 154 | 1687 | 1044 | 3087 |
| Lithuania | LTU | 30.77 | 9697 | 0.078 | 0.069 | 274 | 198 | 2055 | 858 | 3385 |
| Luxembourg | LUX | 42.85 | 87546 | 0.068 | 0.039 | 288 | 148 | 1752 | 1318 | 3507 |
| Malta | MLT | 7.27 | 17759 | 0.088 | 0.066 | 237 | 204 | 1932 | 994 | 3367 |
| Netherlands | NLD | 721.70 | 43883 | 0.052 | 0.037 | 292 | 136 | 1599 | 1195 | 3222 |
| Poland | POL | 365.43 | 9586 | 0.074 | 0.066 | 204 | 197 | 1969 | 1035 | 3405 |
| Portugal | PRT | 200.37 | 19077 | 0.075 | 0.052 | 275 | 177 | 1824 | 1240 | 3516 |
| Romania | ROU | 125.08 | 6062 | 0.062 | 0.059 | 200 | 220 | 2003 | 916 | 3340 |
| Slovak Republic | SVK | 70.82 | 13146 | 0.072 | 0.059 | 140 | 150 | 1610 | 952 | 2853 |
| Slovenia | SVN | 39.50 | 19502 | 0.066 | 0.065 | 230 | 168 | 1664 | 1067 | 3129 |
| Spain | ESP | 1176.91 | 26263 | 0.084 | 0.059 | 279 | 159 | 1481 | 1323 | 3243 |
| Sweden | SWE | 425.83 | 46181 | 0.058 | 0.040 | 285 | 143 | 1566 | 1137 | 3131 |
| Switzerland | CHE | 490.18 | 64036 | 0.050 | 0.030 | 237 | 138 | 1660 | 1392 | 3426 |
| United Kingdom | GBR | 2416.76 | 39338 | 0.063 | 0.049 | 232 | 178 | 1748 | 1256 | 3414 |
We did not have any NA values in our tables, we think this is due to the fact that we really spent time on gathering quality data that matched in terms of dates and countries.
First, even though we will be taking the means of the variables with which we are trying to answer our questions, it is interesting to observe their evolution in each country over time. We started with the GDP.
We can see that the GDP of France, Germany, Italy, Spain and the United Kingdom had a significant increase between 2000 and 2008.
Now let’s see if there is a relation between the GDP of a country and its diabetes prevalence. (men = blue, women = red)
We observe that apart of 5 outliers, our observations are mostly bunched up at the left of the graph. We decided to exclude these 5 observations, to see if we can observe a trend with the other countries. These outliers, as we can see on the graph before, are the countries that had a big increase of GDP in the time period of 2000-2013.
Without the outliers, we can see a bit more clearly. Indeed, it seems that the richer a country is, the lesser it has a high diabetes rate among its population.
For the second table, we tried to see again if there was a trend in the consumption of different macro-nutrients in the 2000s for each country in our sample.
In different countries, there is one difference that stands out and that seems to be related to wealth. Indeed, countries with a higher GDP like Austria consume on average more fat as can be seen on this graph:
Whereas, countries with a lower GDP like Bulgaria have a lower fat consumption, as seen below:
There do not seem to be any trends in the graphs above and diets seem rather stable in each country, which is why we will take the average consumption for each macro-nutrient for our analysis. We can however note that the 5 outliers mentioned before tend to have a higher fat consumption than the countries with a smaller GDP.
We then wanted to analyse the relation between a country’s GDP and its individual consumption of each macronutrient as well as its total calorie consumption to see if there’s a trend.(total calories = orange, fat = blue, carbohydrates = purple, animal protein = red, plant protein = green)
We see that the calorie consumption does not really change. We wanted a close up on the relation between the total calorie consumption with the GDP for each country to see if we can spot outliers again, so we created other plots.
We end up again with these 5 outliers that have a higher than average GDP so if we remove them, we obtain the following plots :
Now we can more easily state that there’s a trend. It appears that the higher a country’s GDP, the higher the total calories consumed, contrary to our hypothesis.
Once again, we tried to see if the diabetes prevalence in each country changed over the years 2000-2013.
We saw right away that the prevalence of diabetes is higher for man than women across all countries (there are however two exceptions : in Romania between 2000 and 2003 and Slovenia between 2000 and 2006).
We observed three different scenarios for countries that we selected: A decrease of diabetes over time for:
We take Belgium as an example :
A decrease over time for women but not for men for :
We take Austria as an example :
In other European countries, the prevalence of diabetes is increasing (at different paces) over time.
We take Croatia as an example :
Finally, we want to plot the relation between the diabetes prevalence against the total calorie consumption as well as each type of macronutrient consumed.
We can see a negative trend for the total consumption, the calories from animal protein and the calories from fat. We can observe a positive trend against calories from plant protein. For protein from carbohydrates, we can see a slighty positive trend for women.
Now, since they affected our plots that included the GDP variable so much, we want to see if we have different trends when we remove our 5 outliers.
Without our 5 outliers, we observe not much change in the trend of each type of calories consumed apart for carbohydrates where the trend changes for men and become slightly positive.
This first question serves more as a control, since we learned during our research prior to our project that countries with higher GDPs tend to have lower diabetes rates. Indeed, we can observe that in the EDA.
It is important to note that, when we try to fit a linear model on these variables and observe correlations over all observations, we see that these relationships are not significant at all.
| Between average GDP and women diabetes rate | -0.369 |
| Between average GDP and men diabetes rate | -0.236 |
| Average GDP vs Women diabetes rate | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1847.35 | 654.70 | 2.82 | 0.009 |
| prop women diabetes | -25161.76 | 12203.46 | -2.06 | 0.049 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.136 / 0.104 | |||
| Average GDP vs Men diabetes rate | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1890.37 | 1086.48 | 1.74 | 0.093 |
| prop men diabetes | -19988.04 | 15812.10 | -1.26 | 0.217 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.056 / 0.021 | |||
However, once we exclude outliers, we see that the relationship is way more significant !
| Between average GDP and women diabetes rate | -0.696 |
| Between average GDP and men diabetes rate | -0.739 |
| Average GDP vs Women diabetes rate (without outliers | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 741.51 | 123.89 | 5.99 | <0.001 |
| prop women diabetes | -10299.65 | 2263.83 | -4.55 | <0.001 |
| Observations | 24 | |||
| R2 / R2 adjusted | 0.485 / 0.461 | |||
| Average GDP vs Men diabetes rate (without outliers | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1147.62 | 187.48 | 6.12 | <0.001 |
| prop men diabetes | -14048.33 | 2730.12 | -5.15 | <0.001 |
| Observations | 24 | |||
| R2 / R2 adjusted | 0.546 / 0.526 | |||
In the EDA section, we grouped countries in 3 categories, according to the relationship between the GDP and diabetes rate. Here, we confirmed statistically that a relationship exists between these 2 variables when we remove outliers. This therefore made us think that these countries could be categorized into clusters.
To determine the number of clusters we used the elbow method. This method examines the percentage of variance explained as a function of the number of clusters. It is based on the idea that a number of clusters should be chosen such that the addition of another cluster does not allow for a better modeling of the data. The percentage of variance explained by the clusters is plotted against the number of clusters.
We therefore see from the graph above that the optimal number of clusters is 3. The allocation of countries according to their cluster is therefore as follows:
| avg_gdp | gdp_per_person | prop_men_diabetes | prop_women_diabetes | cal_prot_animal | cal_prot_plant | cal_carbs | cal_fat | total_consumption | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| AUT | 336 | 40545 | 0.053 | 0.034 | 245 | 169 | 1833 | 1454 | 3702 | 1 |
| BEL | 407 | 38015 | 0.057 | 0.039 | 238 | 158 | 1856 | 1467 | 3719 | 1 |
| DNK | 275 | 50213 | 0.055 | 0.035 | 273 | 157 | 1746 | 1190 | 3366 | 1 |
| FIN | 217 | 40806 | 0.066 | 0.044 | 269 | 166 | 1623 | 1177 | 3234 | 1 |
| FRA | 2283 | 35700 | 0.071 | 0.044 | 293 | 161 | 1611 | 1480 | 3545 | 1 |
| DEU | 3004 | 36733 | 0.059 | 0.040 | 240 | 159 | 1785 | 1276 | 3460 | 1 |
| IRL | 203 | 46899 | 0.069 | 0.049 | 279 | 174 | 1950 | 1187 | 3590 | 1 |
| ITA | 1873 | 32000 | 0.065 | 0.047 | 241 | 204 | 1776 | 1390 | 3612 | 1 |
| NLD | 722 | 43883 | 0.052 | 0.037 | 292 | 136 | 1599 | 1195 | 3222 | 1 |
| SWE | 426 | 46181 | 0.058 | 0.040 | 285 | 143 | 1566 | 1137 | 3131 | 1 |
| CHE | 2417 | 39338 | 0.063 | 0.049 | 232 | 178 | 1748 | 1256 | 3414 | 1 |
| avg_gdp | gdp_per_person | prop_men_diabetes | prop_women_diabetes | cal_prot_animal | cal_prot_plant | cal_carbs | cal_fat | total_consumption | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| BGR | 37.41 | 4988 | 0.073 | 0.064 | 155 | 168 | 1606 | 846 | 2775 | 2 |
| HRV | 48.12 | 11189 | 0.071 | 0.059 | 168 | 148 | 1691 | 940 | 2946 | 2 |
| CYP | 20.15 | 18890 | 0.077 | 0.056 | 197 | 127 | 1291 | 1019 | 2633 | 2 |
| CZE | 158.48 | 15283 | 0.078 | 0.065 | 218 | 153 | 1728 | 1155 | 3254 | 2 |
| EST | 16.47 | 12283 | 0.071 | 0.064 | 212 | 167 | 1967 | 842 | 3188 | 2 |
| GRC | 246.19 | 22344 | 0.069 | 0.060 | 250 | 204 | 1744 | 1338 | 3536 | 2 |
| HUN | 110.92 | 11050 | 0.080 | 0.063 | 193 | 153 | 1560 | 1221 | 3126 | 2 |
| LVA | 21.04 | 9774 | 0.071 | 0.065 | 203 | 154 | 1687 | 1044 | 3087 | 2 |
| LTU | 30.77 | 9697 | 0.078 | 0.069 | 274 | 198 | 2055 | 858 | 3385 | 2 |
| MLT | 7.27 | 17759 | 0.088 | 0.066 | 237 | 204 | 1932 | 994 | 3367 | 2 |
| POL | 365.43 | 9586 | 0.074 | 0.066 | 204 | 197 | 1969 | 1035 | 3405 | 2 |
| PRT | 200.37 | 19077 | 0.075 | 0.052 | 275 | 177 | 1824 | 1240 | 3516 | 2 |
| ROU | 125.08 | 6062 | 0.062 | 0.059 | 200 | 220 | 2003 | 916 | 3340 | 2 |
| SVK | 70.82 | 13146 | 0.072 | 0.059 | 140 | 150 | 1610 | 952 | 2853 | 2 |
| SVN | 39.50 | 19502 | 0.066 | 0.065 | 230 | 168 | 1664 | 1067 | 3129 | 2 |
| ESP | 1176.91 | 26263 | 0.084 | 0.059 | 279 | 159 | 1481 | 1323 | 3243 | 2 |
| avg_gdp | gdp_per_person | prop_men_diabetes | prop_women_diabetes | cal_prot_animal | cal_prot_plant | cal_carbs | cal_fat | total_consumption | cluster | |
|---|---|---|---|---|---|---|---|---|---|---|
| LUX | 42.9 | 87546 | 0.068 | 0.039 | 288 | 148 | 1752 | 1318 | 3507 | 3 |
| GBR | 490.2 | 64036 | 0.050 | 0.030 | 237 | 138 | 1660 | 1392 | 3426 | 3 |
It is important to note that our 3rd cluster only contains 2 countries, and therefore when we analyse correlation and linear regression parameters, we will be mainly considering the overall nature of the relationships. (positive or negative link?)
To go further, we decided to represent on a cluster map to see if there was a difference between the north and the south for example.
The first cluster is therefore located more in the centre and north of Europe.
The second cluster would be located more to the southwest of Europe.
And the third cluster represents only two countries so it is hard to give it a region.
Now let’s plot these clusters to see the differences between them.
In the graph above,
We can therefore see that diabetes is indeed lower in clusters which contain the countries with higher GPDs. Let’s see if this relationship is significant within each cluster :
| Between average GDP and women diabetes rate | 0.368 |
| Between average GDP and men diabetes rate | 0.297 |
| Average GDP vs Women diabetes rate (cluster 1) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -1949.02 | 2590.14 | -0.75 | 0.471 |
| prop women diabetes | 73395.77 | 61776.03 | 1.19 | 0.265 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.136 / 0.040 | |||
| Average GDP vs Men diabetes rate (cluster 1) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -1823.00 | 3150.00 | -0.58 | 0.577 |
| prop men diabetes | 48218.14 | 51590.68 | 0.93 | 0.374 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.088 / -0.013 | |||
| Between average GDP and women diabetes rate | -0.235 |
| Between average GDP and men diabetes rate | 0.325 |
| Average GDP vs Women diabetes (cluster 2) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1113.78 | 1048.47 | 1.06 | 0.306 |
| prop women diabetes | -15283.41 | 16887.89 | -0.90 | 0.381 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.055 / -0.012 | |||
| Average GDP vs Men diabetes (cluster 2) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -901.34 | 834.85 | -1.08 | 0.299 |
| prop men diabetes | 14396.62 | 11208.38 | 1.28 | 0.220 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.105 / 0.042 | |||
| Between average GDP and women diabetes rate | -1 |
| Between average GDP and men diabetes rate | -1 |
| Average GDP vs Women diabetes (cluster 3) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1920.92 | NaN | NaN | NaN |
| prop women diabetes | -47597.62 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
| Average GDP vs Men diabetes (cluster 3) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1698.19 | NaN | NaN | NaN |
| prop men diabetes | -24247.81 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
Clustering here doesn’t necessarily help us answer the research question. We can confirm with the previous statistical test without outliers that overall, the GDP of a country and its diabetes rate are negatively correlated.
However, the clusters defined above could help us answer our other research questions.
As mentioned in the first point, countries with a higher GDP tend to have a lower diabetes rate, which could potentially be explained by the consumption of fewer calories.
But is there a real correlation between these two variables ? Let’s check :
| Between average GDP and total calories | 0.355 |
| Average GDP vs total consumption | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -3017.67 | 1806.15 | -1.67 | 0.106 |
| total consumption | 1.07 | 0.55 | 1.97 | 0.059 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.126 / 0.093 | |||
Neither the correlation between these two variables nor the linear regression is significant. However, it would be interesting to look further. First, we can see if using the GDP per person instead of the total average makes a difference.
The plot looks more or less the same as the one in the EDA section without the outliers. But is the relationship with this new variable statistically significant ?
| Between GDP per person and total calories | 0.458 |
| GDP per person vs total consumption | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -80976.25 | 41007.78 | -1.97 | 0.059 |
| total consumption | 33.19 | 12.39 | 2.68 | 0.012 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.210 / 0.181 | |||
We see that the correlation with the variable gdp_per_personis higher now and the significance of the parameter in the regression, even though not high enough, increased.
Next we can proceed with an analysis within clusters, with the ones defined in the first question.
| Between GDP per person and total calories | -0.469 |
| GDP per person vs total consumption (cluster 1) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 84886.45 | 27655.84 | 3.07 | 0.013 |
| total consumption | -12.72 | 7.99 | -1.59 | 0.146 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.220 / 0.133 | |||
| Between GDP per person and total calories | 0.224 |
| GDP per person vs total consumption (cluster 2) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -1941.55 | 18811.00 | -0.10 | 0.919 |
| total consumption | 5.08 | 5.91 | 0.86 | 0.404 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.050 / -0.018 | |||
| Between GDP per person and total calories | 1 |
| GDP per person vs total consumption (cluster 3) | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -939315.96 | NaN | NaN | NaN |
| total consumption | 292.84 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
The results indicate that even within the clusters, these variables are not exactly correlated and the total calories consumed per person in a country is not a significant indicator of this country’s wealth.
When we look at our cluster plot, in terms of total calorie intake it is also the 1st cluster, which contains the richest countries considering GDP per capita, that consumes the most calories. One might therefore think that calorie consumption is not the main reason why high-GDP countries have lower diabetes rates.
We observed during the EDA that richer countries seemed to consume more fat on average. Now we want to see if we can confirm this relationship, and find out if there is a correlation between the average GDP of a country and the calories consumed for other macronutrients too.
Since the units of the variables “calories consumed from animal protein/plant protein/carbs/fat” are on a per capita basis, we think that including the GDP per capita in this part of the analysis could improve results. Additionally, we thought that another way of answering this research question could be to include the proportions of the calories consumed from different macronutrients on the total calories and see if it affects our analysis.
With these new metrics introduced, we will plot these relationships again.
It is also important to plot the clusters again to observe differences in the proportions of calories consumed.
We will be referring to these graphs and this cluster plot in our analysis in the next sections.
Let’s start with by analysing the relationship between a country’s wealth and its consumption of animal protein. We saw at the EDA section that there didn’t seem to be a trend between these 2 variables, overall the calories consumed from animal protein is very stable regardless of the average GDP of a country. We can confirm this by checking the correlation and by trying to fit a linear regression.
| Between average GDP and calories from animal protein | 0.272 |
| Average GDP vs calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -716.41 | 860.87 | -0.83 | 0.413 |
| cal prot animal | 5.28 | 3.59 | 1.47 | 0.153 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.074 / 0.040 | |||
Even though the correlation is positive, it is not very high. By looking at the summary of the linear regression, we can also see that the parameter is not significant. But what if we took the GDP per capita instead of the average GDP of a country ? The graph above seems to show again that the consumption of this macronutrient is very stable regardless of the GDP per capita of a country.
| Between GDP per person and calories from animal protein | 0.637 |
| Between GDP per person and calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -41157.10 | 16464.75 | -2.50 | 0.019 |
| cal prot animal | 295.34 | 68.69 | 4.30 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.406 / 0.384 | |||
This time, the correlation is higher and the variable’s estimate as a parameter is very significant ! Maybe the correlation can improve by considering proportion of calories instead of the calorie count.
| Between GDP per person and proportion of calories from animal protein | 0.536 |
| Between GDP per person and proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -41451.02 | 21463.09 | -1.93 | 0.064 |
| proportion animal prot | 981450.06 | 297643.37 | 3.30 | 0.003 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.287 / 0.261 | |||
Here, the correlation is slightly lower than with the calorie count and the parameter’s estimate is a bit less significant but still quite solid. Now let’s see if we can find this positive relationship and better explanations within our defined clusters.
From our cluster plot above, we see that high GDP countries (clusters 1 and 3) tend to consume more calories of animal protein than low GDP countries, this is also the case for the proportions of calories consumed from animal protein. This follows what we have found so far. Are the correlations and linear regression in line with these observations ?
| Between GDP per person and proportion of calories from animal protein | 0.566 |
| Between GDP per person and proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 17379.15 | 11526.77 | 1.51 | 0.166 |
| proportion animal prot | 308366.99 | 149728.86 | 2.06 | 0.070 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.320 / 0.245 | |||
| Between GDP per person and proportion of calories from animal protein | 0.676 |
| Between GDP per person and proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -13068.03 | 8010.86 | -1.63 | 0.125 |
| proportion animal prot | 404664.90 | 117758.56 | 3.44 | 0.004 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.458 / 0.419 | |||
| Between GDP per person and proportion of calories from animal protein | 1 |
| Between GDP per person and proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -61499.76 | NaN | NaN | NaN |
| proportion animal prot | 1812514.02 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
For each cluster, the correlation is positive and but still not so strong, and the estimate of the parameter for cluster 2 is significant.
In general, though it seems positive, we cannot conclude that there is a strong link between the wealth of a country and its animal protein consumption.
Now let’s move on to the analysis of the relationship between a country’s wealth and it’s consumption of plant protein. We saw at the EDA section, again, that there didn’t seem to be a trend between these 2 variables. The calories consumed from plant protein is very stable as we can see from the horizontal line, regardless of the average GDP of a country. To confirm this, we will check the correlation.
| Between average GDP and calories from plant protein | 0.043 |
| Between average GDP and calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 276.48 | 1135.65 | 0.24 | 0.809 |
| cal prot plant | 1.52 | 6.75 | 0.23 | 0.823 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.002 / -0.035 | |||
As we can see, the correlation is very low and when we try to fit a linear model over these 2 variables, the parameter estimation is not significant at all. Let’s see if taking the GDP per capita makes a difference. In the graph above, the consumption of this macronutrient is very stable too.
| Between GDP per person and calories from plant protein | -0.357 |
| Between GDP per person and calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 78466.08 | 25364.62 | 3.09 | 0.005 |
| cal prot plant | -299.07 | 150.66 | -1.99 | 0.057 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.127 / 0.095 | |||
The correlation improves when we consider the GDP per capita instead of the average. We also see that the relationship between the 2 variables is rather negative. However the parameter estimation is still not significant enough, even though it improved. Can we find a better explanation by using the proportion of calories rather than the calorie count of the plant protein consumed ?
| Between GDP per person and proportion of calories from plant protein | -0.683 |
| Between GDP per person and proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 135049.20 | 22065.03 | 6.12 | <0.001 |
| proportion plant prot | -2103072.30 | 432611.97 | -4.86 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.467 / 0.447 | |||
The correlation improved immensely ! Now we can confirm that the relationship is rather negative, even though the correlation is rather moderate than strong. But now our parameter estimation is as significant as possible. How does this translate to our clusters ?
The cluster plot above indicates that in countries with higher GDPs like in cluster 1 and 3, the proportions of calories consumed from plant protein are lower than in countries with lower GDPs. Let’s see if we can confirm these observations and see if these relationships are strong enough within each cluster.
| Between GDP per person and proportion of calories from plant protein | -0.351 |
| Between GDP per person and proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 61779.35 | 18585.68 | 3.32 | 0.009 |
| proportion plant prot | -438723.62 | 389754.96 | -1.13 | 0.289 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.123 / 0.026 | |||
| Between GDP per person and proportion of calories from plant protein | -0.417 |
| Between GDP per person and proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 38383.00 | 14155.73 | 2.71 | 0.017 |
| proportion plant prot | -448628.20 | 261104.64 | -1.72 | 0.108 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.174 / 0.115 | |||
| Between GDP per person and proportion of calories from plant protein | 1 |
| Between GDP per person and proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -398204.08 | NaN | NaN | NaN |
| proportion plant prot | 11498699.32 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
We see that the correlations within clusters is weaker. And actually, even in countries with lower GDPs that belong to cluster 2, the relationship is rather negative. This indicates us that visual interpretation is not enough for analysis. We can maybe say that less wealthy countries that are more developed than those in the same cluster as them may consume more plant protein than more wealthy countries but they still consume less of this macronutrient than their cluster mates.
Therefore, we cannot conclude that there is a significant link between the wealth of a country and its plant protein consumption, although this relationship seems negative.
Next, we will be evaluating the relationship of a country’s GDP and its carbohydrate consumption. In the EDA graph, there isn’t really trend in general between these 2 variables. The calories consumed from carbs is more or less stable for countries with an average GDP of around 500 billion dollars and more. Let’s check the correlation.
| Between average GDP and calories from carbs | -0.083 |
| Between average GDP and calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 1212.64 | 1588.16 | 0.76 | 0.452 |
| cal carbs | -0.39 | 0.91 | -0.43 | 0.669 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.007 / -0.030 | |||
The correlation is negative and very low. This could maybe be explained by the fact that variance is very high at the left part of the graph where the average GDP is lower, up until 500 billion dollars. The parameter estimation is not significant either, this regression is not appropriate to explain this relationship.
But what if we compared the 2 metrics at the same unit level like in the previous 2 sections ? Our graph above indicates that there may be a slight decrease of the caloric intake of carbs per capita the more a country’s GDP per capita is higher. Let’s see if this is true.
| Between GDP per person and calories from carbs | -0.105 |
| Between GDP per person and calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 49289.50 | 37855.69 | 1.30 | 0.204 |
| cal carbs | -11.94 | 21.72 | -0.55 | 0.587 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.011 / -0.026 | |||
Indeed, the correlation is higher now but it is still not strong enough. Our parameter is once again not significant to explain this new dependent variable either. But what if we changed our independent variable ? Because we see in the second graph above that this negative slope is even more accentuated when we take into account the proportion of the carbs consumed instead of their calorie count.
| Between GDP per person and proportion of calories from carbs | -0.575 |
| Between GDP per person and proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 163176.52 | 36974.37 | 4.41 | <0.001 |
| proportion carbs | -255576.19 | 69975.38 | -3.65 | 0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.331 / 0.306 | |||
The correlation between these 2 variables is way more important, we can even say that their correlation is moderately strong. And the parameter estimation when we fit a linear model is quite significant. Let’s observe these relationships now within each cluster. According to our cluster plot, countries of cluster 1 and 3 (high GDP) consume a smaller proportion of carbs then those of cluster 2 (low GDP).
| Between GDP per person and proportion of calories from carbs | 0.553 |
| Between GDP per person and proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -28376.28 | 34846.14 | -0.81 | 0.436 |
| proportion carbs | 137918.44 | 69276.82 | 1.99 | 0.078 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.306 / 0.229 | |||
| Between GDP per person and proportion of calories from carbs | -0.751 |
| Between GDP per person and proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 66623.14 | 12383.47 | 5.38 | <0.001 |
| proportion carbs | -95782.01 | 22540.38 | -4.25 | 0.001 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.563 / 0.532 | |||
| Between GDP per person and proportion of calories from carbs | 1 |
| Between gdp per person and proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -678064.39 | NaN | NaN | NaN |
| proportion carbs | 1532052.98 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
Surprisingly, with these correlations we see that the previous logic is inversed within clusters. For richer countries like those in cluster 1 and 3 the link between GDP and carb consumption is positive within their respective clusters, and for less rich countries like in those in cluster 2, it is negative. The estimate for the model parameter for cluster 2 is even very significant.
But overall, we cannot conclude that the link between the wealth of a country and its carb consumption is negative, as it is not strong enough.
For the last part of our 3rd research question, let’s analyse the relationship of a country’s GDP and its fat consumption. Once again when we take a look at the graph at the EDA section, at the left side of the graph, there seems to be a slight increase. But the calories consumed from fat is more or less stable for countries with an average GDP of around 500 billion dollars and more, just like for the carbs. What does the correlation coefficient tell us ?
| Between average GDP and calories from fat | 0.503 |
| Between average GDP and calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -1941.46 | 827.39 | -2.35 | 0.027 |
| cal fat | 2.13 | 0.70 | 3.03 | 0.005 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.253 / 0.226 | |||
From the get-go, the correlation is moderately high and positive. The estimation of the model parameter is also quite significant. Does taking the GDP per capita change our results ? In the first graph above, the fat consumption seems to very slightly increase when a country’s GDP is higher. Let’s check.
| Between GDP per person and calories from fat | 0.636 |
| Between GDP per person and calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -46079.39 | 17640.70 | -2.61 | 0.015 |
| cal fat | 64.23 | 14.98 | 4.29 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.405 / 0.383 | |||
Our hypothesis was right. Not only did the correlation between the 2 variables increase, but the significance of the parameter estimation also improved. And how does switching the calorie count with the proportion of fat consumed per capita affect the correlation with the GDP per capita ? On the second graph above, we see that, just like for carbs, the positive slope becomes a bit more accentuated.
| Between GDP per person and proportion of calories from fat | 0.544 |
| Between GDP per person and proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -56316.18 | 25410.86 | -2.22 | 0.035 |
| proportion fat | 241607.12 | 71783.98 | 3.37 | 0.002 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.296 / 0.269 | |||
This time, the correlation and the significance of the parameter goes down a little bit, even though not dramatically. Let’s see once more what these relationships are like within each cluster. By looking at the cluster plot, we can hypothesize that wealthier countries of cluster 1 and 3 consume a bigger proportion of fat then those of cluster 2 that are less wealthy.
| Between GDP per person and proportion of calories from fat | -0.695 |
| Between GDP per person and proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 101563.63 | 20955.14 | 4.85 | 0.001 |
| proportion fat | -162307.53 | 56002.68 | -2.90 | 0.018 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.483 / 0.425 | |||
| Between GDP per person and proportion of calories from fat | 0.659 |
| Between GDP per person and proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -13798.71 | 8617.08 | -1.60 | 0.132 |
| proportion fat | 84479.61 | 25779.67 | 3.28 | 0.006 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.434 / 0.394 | |||
| Between GDP per person and proportion of calories from fat | -1 |
| Between GDP per person and proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 378543.88 | NaN | NaN | NaN |
| proportion fat | -774350.31 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
Just like for carbs, surprisingly, we see again that the previous logic is inversed within clusters by looking at these correlations. For high GDP countries like those in cluster 1 and 3 the link between GDP and fat consumption is negative within their respective clusters, and for low GDP countries like in those in cluster 2, it is positive. The estimate for the model parameter for cluster 2 is again quite significant. This suggests that among rich countries, the countries that are the most rich tend to consume less fat, and vice versa.
In general, even though it seems positive, we cannot confidently say that there is a definite link between the wealth of a country and its fat consumption.
What we observed in the EDA seemed to not make sense to us as we were expecting a positive relationship between the total calories/calories consumed from fat and diabetes prevalence in a country. Let’s see now concretely if there is any correlation between the calories consumed from different macronutrients and the diabetes rate.
| Animal protein | Plant protein | Carbs | Fat | |
|---|---|---|---|---|
| Men | -0.246 | 0.227 | -0.032 | -0.404 |
| Women | -0.515 | 0.387 | 0.155 | -0.697 |
But we also saw in the previous question that taking into account the proportions rather than the calorie count for the macronutrients gave us better results. Let’s see if the correlation with diabetes prevalence improves too.
| Animal protein | Plant protein | Carbs | Fat | |
|---|---|---|---|---|
| Men | -0.141 | 0.452 | 0.272 | -0.303 |
| Women | -0.399 | 0.706 | 0.616 | -0.621 |
It does ! So to answer this question, we will rather work with the variables related to the proportions of the different macronutrients consumed than the exact calories consumed.
We will now look in more detail at the relationship of each macronutrient with diabetes prevalence, starting with animal protein.
In the EDA, we observe a decreasing trend for the animal protein related to diabetes. The graph below with the proportions give us more or less the same result.
At the beginning of this question we saw that the correlation between these 2 variables is negative, but not very strong. This can be confirmed with the results of the linear regression below too.
| Diabetes rate for men vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.08 | 0.01 | 6.19 | <0.001 |
| proportion animal prot | -0.13 | 0.17 | -0.74 | 0.466 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.020 / -0.016 | |||
| Diabetes rate for women vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.08 | 0.01 | 5.89 | <0.001 |
| proportion animal prot | -0.45 | 0.20 | -2.26 | 0.032 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.159 / 0.128 | |||
So there is a negative trend for both men and women, but we also observe see that the effect is stronger for women. The linear regression does not indicate any relation since the parameters are not significant. Now let’s observe the relationship between diabetes and animal protein within each cluster.
| Between men diabetes rate and proportion of calories from animal protein | -0.023 |
| Between women diabetes rate and proportion of calories from animal protein | -0.114 |
| Diabetes rate for men vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.02 | 3.66 | 0.005 |
| proportion animal prot | -0.02 | 0.22 | -0.07 | 0.947 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.001 / -0.111 | |||
| Diabetes rate for women vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.05 | 0.01 | 3.39 | 0.008 |
| proportion animal prot | -0.06 | 0.18 | -0.35 | 0.738 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.013 / -0.097 | |||
| Between men diabetes rate and proportion of calories from animal protein | 0.386 |
| Between women diabetes rate and proportion of calories from animal protein | -0.067 |
| Diabetes rate for men vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.01 | 5.27 | <0.001 |
| proportion animal prot | 0.25 | 0.16 | 1.57 | 0.140 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.149 / 0.088 | |||
| Diabetes rate for women vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.01 | 7.97 | <0.001 |
| proportion animal prot | -0.03 | 0.12 | -0.25 | 0.804 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.005 / -0.067 | |||
| Between men diabetes rate and proportion of calories from animal protein | 1 |
| Between women diabetes rate and proportion of calories from animal protein | 1 |
| Diabetes rate for men vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.05 | NaN | NaN | NaN |
| proportion animal prot | 1.42 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
| Diabetes rate for women vs proportion of calories from animal protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.02 | NaN | NaN | NaN |
| proportion animal prot | 0.72 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
For the first and second cluster, we have a negative relationship which is in line with the cluster graph, even though the correlation is not strong. It is also important to take into account that the regressions are not siginificant and it is therefore hard to draw anything from them.
Regarding the last cluster, we should expect a negative relationship but it seems to be positive. However, this correlation is very weak.
We saw at the EDA section, that there seemed to be a trend between these 2 variables. The calories consumed from plant protein seemed to increase the diabetes prevalence. As we can see in the graph below, it seems that with proportions, this slope is even more accentuated.
We can check if this relationship is significant with a linear regression.
| Diabetes rate for men vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.03 | 0.01 | 2.49 | 0.019 |
| proportion plant prot | 0.69 | 0.26 | 2.63 | 0.014 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.204 / 0.175 | |||
| Diabetes rate for women vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.02 | 0.01 | -1.15 | 0.260 |
| proportion plant prot | 1.33 | 0.26 | 5.17 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.498 / 0.479 | |||
The parameter is very significant when trying to explain the diabetes rate in women. With correlations at 0.452 and 0.706 respectively, these relationships are also moderately strong. Is it also the case for the clusters ?
| Between men diabetes rate and proportion of calories from plant protein | 0.535 |
| Between women diabetes rate and proportion of calories from plant protein | 0.690 |
| Diabetes rate for men vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.02 | 0.02 | 1.10 | 0.299 |
| proportion plant prot | 0.81 | 0.42 | 1.90 | 0.090 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.287 / 0.207 | |||
| Diabetes rate for women vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.00 | 0.01 | 0.10 | 0.923 |
| proportion plant prot | 0.85 | 0.30 | 2.86 | 0.019 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.476 / 0.418 | |||
| Between men diabetes rate and proportion of calories from plant protein | -0.281 |
| Between women diabetes rate and proportion of calories from plant protein | 0.292 |
| Diabetes rate for men vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.09 | 0.02 | 5.67 | <0.001 |
| proportion plant prot | -0.33 | 0.30 | -1.10 | 0.292 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.079 / 0.013 | |||
| Diabetes rate for women vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.05 | 0.01 | 4.48 | 0.001 |
| proportion plant prot | 0.23 | 0.20 | 1.14 | 0.272 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.085 / 0.020 | |||
| Between men diabetes rate and proportion of calories from plant protein | 1 |
| Between women diabetes rate and proportion of calories from plant protein | 1 |
| Diabetes rate for men vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.31 | NaN | NaN | NaN |
| proportion plant prot | 9.02 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
| Diabetes rate for women vs proportion of calories from plant protein | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.15 | NaN | NaN | NaN |
| proportion plant prot | 4.60 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
For all of our clusters, according to the cluster graphs, we were expecting to see a positive relationship and our results mostly corresponded to those expectations. The correlations are quite strong for men and even higher for women in high GDP countries.
However in the second cluster, there is a negative correlation between the proportion of plant protein consumed and the diabetes prevalence in men, though not very strong.
In the EDA, it seems that carbohydrates have a rather flat trend compared to diabetes. Our graph with proportions below however, shows us a positive slope.
Indeed, the correlations between the proportion of carbs consumed and diabetes prevalence in men and women were positive, at 0.272 and 0.616 respectively. Let’s see the results when we try to fit a linear model.
| Diabetes rate for men vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.04 | 0.02 | 1.70 | 0.101 |
| proportion carbs | 0.06 | 0.04 | 1.47 | 0.154 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.074 / 0.039 | |||
| Diabetes rate for women vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.04 | 0.02 | -1.65 | 0.110 |
| proportion carbs | 0.17 | 0.04 | 4.06 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.380 / 0.357 | |||
We see that the linear regression for women is very significant, so increasing carbohydrates in a women’s diet would increase their diabetes rate. What do our clusters indicate ?
| Between men diabetes rate and proportion of calories from carbs | -0.114 |
| Between women diabetes rate and proportion of calories from carbs | 0.125 |
| Diabetes rate for men vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.08 | 0.05 | 1.55 | 0.155 |
| proportion carbs | -0.03 | 0.10 | -0.34 | 0.738 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.013 / -0.097 | |||
| Diabetes rate for women vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.03 | 0.04 | 0.64 | 0.535 |
| proportion carbs | 0.03 | 0.08 | 0.38 | 0.715 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.016 / -0.094 | |||
| Between men diabetes rate and proportion of calories from carbs | -0.309 |
| Between women diabetes rate and proportion of calories from carbs | 0.486 |
| Diabetes rate for men vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.10 | 0.02 | 5.05 | <0.001 |
| proportion carbs | -0.04 | 0.04 | -1.21 | 0.245 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.095 / 0.031 | |||
| Diabetes rate for women vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.04 | 0.01 | 3.04 | 0.009 |
| proportion carbs | 0.05 | 0.02 | 2.08 | 0.057 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.236 / 0.181 | |||
| Between men diabetes rate and proportion of calories from carbs | 1 |
| Between women diabetes rate and proportion of calories from carbs | 1 |
| Diabetes rate for men vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.53 | NaN | NaN | NaN |
| proportion carbs | 1.20 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
| Diabetes rate for women vs proportion of calories from carbs | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | -0.27 | NaN | NaN | NaN |
| proportion carbs | 0.61 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
For all of our clusters, we generally have positive correlations. This is what was expected by observing the cluster plot. Moreover, for the cluster 1 and 2 the correlations are negative in regards to diabetes prevalence in men. However these correlations are not strong at all, and as for our parameter estimations for the linear models we tried to fit over these variables, they are not significant.
Bur the positive trend between the consumption of carbs and diabetes prevalence is what makes the most sense to us so far, as foods that contain carbs like fast food etc tend to be bad for diabetes.
In the EDA part, the fats have a rather downward trend. Below, we have more or less the same graph with the proportions.
The correlation coefficients at the beginning of the question also showed us that there’s a negative relationship : -0.303 and -0.621 respectively for diabetes rates in men and women, in regards to the proportion of fat consumed. Let’s see what happens when we try to fit linear models.
| Diabetes rate for men vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.09 | 0.01 | 6.40 | <0.001 |
| proportion fat | -0.07 | 0.04 | -1.65 | 0.110 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.092 / 0.058 | |||
| Diabetes rate for women vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.11 | 0.01 | 7.68 | <0.001 |
| proportion fat | -0.17 | 0.04 | -4.12 | <0.001 |
| Observations | 29 | |||
| R2 / R2 adjusted | 0.386 / 0.363 | |||
The parameter for fat is highly significant for women even though it only causes a little decrease to the diabetes rate. Are these results replicated in the clusters ?
| Between men diabetes rate and proportion of calories from fat | 0.016 |
| Between women diabetes rate and proportion of calories from fat | -0.197 |
| Diabetes rate for men vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.04 | 1.68 | 0.128 |
| proportion fat | 0.00 | 0.09 | 0.05 | 0.962 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.000 / -0.111 | |||
| Diabetes rate for women vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.03 | 2.08 | 0.067 |
| proportion fat | -0.05 | 0.08 | -0.60 | 0.562 |
| Observations | 11 | |||
| R2 / R2 adjusted | 0.039 / -0.068 | |||
| Between men diabetes rate and proportion of calories from fat | 0.261 |
| Between women diabetes rate and proportion of calories from fat | -0.508 |
| Diabetes rate for men vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.06 | 0.01 | 5.19 | <0.001 |
| proportion fat | 0.04 | 0.04 | 1.01 | 0.329 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.068 / 0.001 | |||
| Diabetes rate for women vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.08 | 0.01 | 10.68 | <0.001 |
| proportion fat | -0.05 | 0.02 | -2.21 | 0.044 |
| Observations | 16 | |||
| R2 / R2 adjusted | 0.258 / 0.205 | |||
| Between men diabetes rate and proportion of calories from fat | -1 |
| Between women diabetes rate and proportion of calories from fat | -1 |
| Diabetes rate for men vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.30 | NaN | NaN | NaN |
| proportion fat | -0.61 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
| Diabetes rate for women vs proportion of calories from fat | ||||
|---|---|---|---|---|
| Predictors | Estimates | std. Error | Statistic | p |
| (Intercept) | 0.16 | NaN | NaN | NaN |
| proportion fat | -0.31 | NaN | NaN | NaN |
| Observations | 2 | |||
| R2 / R2 adjusted | 1.000 / NaN | |||
By observing the cluster plot, we can say that there’s a negative relationship between the proportion of fat consumed and diabetes prevalence within each cluster. In our results however that is not the case for the correlation with the diabetes rate in men for cluster 1 and 2, although they are weak. Otherwise, the results are in line with our expectations, but these are at most moderately strong relationships. Our regression results are also not very significant for every cluster.
Here, even though they are not sufficiently solid enough, we are surprised by our results because fatty foods are not good for diabetes, and our data shows that consuming more fat is not necessarily related to lower diabetes.
But then what does the typical diet of countries in each cluster look like ? Are these overall compositions what explain the differences in diabetes rates ?
We are going to investigate this very issue of calorie consumption patterns that could be related to the diabetes rate for each cluster. To see these patterns, we will take the average of every macronutrient in each cluster.
| avg_gdp | gdp_per_person | prop_men_diabetes | prop_women_diabetes | cal_prot_animal | cal_prot_plant | cal_carbs | cal_fat | total_consumption | |
|---|---|---|---|---|---|---|---|---|---|
| Cluster 1 | 1106 | 40938 | 0.061 | 0.042 | 262 | 164 | 1736 | 1292 | 3454 |
| Cluster 2 | 167 | 14181 | 0.074 | 0.062 | 215 | 172 | 1738 | 1049 | 3174 |
| Cluster 3 | 267 | 75791 | 0.059 | 0.035 | 263 | 143 | 1706 | 1355 | 3466 |
As we see from the cluster averages, cluster 2 has the highest rate of diabetes followed by cluster 1 and finally the cluster 3 with the lowest rate of diabetes.
There is not much difference between the diets of our different clusters. However, it should be noted that carbs represent 55% of the diet of cluster 2 and this food greatly influences the rate of diabetes, particularly among women. Indeed, there is a 1% increase in diabetes rate in women for every 0.168 carbs consumed on average per capita with a significant p-value at 1%. The proportion of carbs should therefore be reduced to less than half of the total proportion to try to have a negative impact on the diabetes rate.
Strangely enough, the proportion of fat is also lower in cluster 2 (33%) which is counter intuitive as its average diabetes rates are higher. But this result is consistent with the negative correlations of fat with diabetes rate calculated above. It is therefore complicated to draw a conclusion regarding the proportion of fat to be consumed to reduce the diabetes rate. Our hypothesis is that even if cluster 1 consumes more fat, it is composed of rich countries that have more prevention measures and certainly better hospital infrastructure.
It looks like richer countries in Europe like those of clusters 1 and 3 tend to consume a bit more animal protein and fats, and less rich countries like those of cluster 2 tend to consume more carbs.
With this report, we tried to find out if there were any links between European countries’ wealth, their diabetes rates and their food consumption. Our end goal was to find out if there was a typical diet that was attributed to a certain GDP level and who therefore affected the population’s diabetes, in order to provide an example for governments for when they have to make decisions on food supply while considering public health. Even though we tried using multiple methods and metrics like the correlation coefficient, trying to fit linear models and cluster analysis to answer our research questions, we couldn’t exactly find a lot of meaningful and strong relationships between our various variables.
Our control question on the negative correlation between a country’s GDP and its diabetes prevalence was confirmed with our data without outliers. The general analysis was enough to answer our questions and going deeper within clusters didn’t help us find better results.
Moreover, a hypothesis that we had was that since the GDP and the diabetes rate of a country are negatively correlated, a country with a higher GDP would consume less calories. To our surprise, our data showed the opposite, but then again we didn’t have significant results. Though, this result could make sense when we consider that a higher GDP means access to more resources like food and therefore the consumption of the population is higher.
As for the relationship between the GDP and the consumption of macronutrients, we found out that richer countries in Europe tend to consume more protein sourced from animals and fats (like beef, salmon, avocados etc with tend to be more expensive), whereas poorer countries consume more protein sourced from plants and carbohydrates (like lentils, potatoes and beans which are cheaper).
Then we checked the relationship between the diabetes prevalence in a country and the consumption of different macronutrients. Once again contrary to our hypothesis, it seems from our results that a higher consumption of animal protein could be related to a lower rate of diabetes, which would be counter-intuitive to M. Adeva-Andany’s (2019) article “Dietary habits contribute to define the risk of type 2 diabetes in humans”. The consumption of fat was also negatively correlated to diabetes prevalence, whereas it was the opposite for the consumption of plant protein and carbs. This last one makes more sense to us. However, the effect of all macronutrients except protein on diabetes were highly significant for women, therefore we make the assumption that malnutrition has a greater impact on the diabetes in women than in men.
High GDP countries consume more fat and animal protein, but have a lower diabetes rate in Europe and our results suggest that consuming more fat does not equal higher diabetes. One hypothesis we have found to explain this phenomenon may be that richer countries do more diabetes prevention and have better hospital infrastructure and medical care. Another hypothesis is that the rate of diabetes is more related to the level of exercise in a country.
The results of the analysis within clusters often didn’t make sense or were not significant/strong enough. We didn’t have too many countries to work with but yet we could have tried to increase the cluster number. Then however we would have the risk of having even smaller samples and less observations which are not good for statistical accuracy.
Another limitation is that the linear regressions were not always significant and it was complicated to find the right relationship between the different variables if there were any. Nutrition and health is often complex and it is logical that a simple linear regression with one variable is not enough to explain another.
To go further, it would be interesting to take into account other countries outside of Europe which are more varied, in order to make more distinct clusters. It would have been interesting to know more about the United States, for example, a country with a high GDP and a high rate of diabetes.
Another interesting thing would be to calculate a ratio of the consumption of carbs to fat and see if this index could give us more explanations on the diabetes prevalence in different countries.